Ask A Manager Survey

data analysis

data modelling

This post is about the 2021 Ask A Manager Survey analysis. It provides an overview of the data analysis process attempting to fit a linear regression model to explain variances in reported salaries among the respondents

Author

Sivuyile Nzimeni

Published

December 25, 2021

INTRODUCTION

In early 2021, the Ask A Manager blogsite ran their annual salary survey. The survey responses are stored on googlesheets, making the data accessible to all interested. In a previous post, we discussed data pre-processing and feature selection method. This post focuses two aspects,namely, 1) Reproducibility and 2) Out-of-Sample Testing.

REPRODUCIBILITY

In the previous post, we detailed the feature selection method by regularised regreesion, specifically, LASSO regression. In this post, we will attempt to reproduce the feature selection method.

OUT OF SAMPLE TESTING

In this post, we will also follow the traditional machine learning workflow including, splitting the data into a training and testing samples, fitting a model, cross-validation and finally fitting the model on the out-of-sample dataset(testing dataset). This approach can inform us about the model’s parsimony. In other words, can the model perform well on an unknown sample.

LASSO REGRESSION

As mentioned above, the data cleaning process was detailed on an earlier post. It is worth highlighting that the dataset contained several dummy variables indicating various respondent attributes such as job_title, industry, city, age group etc. Our dependent variable is the annual salary. Given the large number of independent variables. In the code below, we fit a LASSO regression model.

see code

X <- model.matrix(new_annual_salary ~.,Salary_Train)[,-1]
Y <- log(Salary_Train$new_annual_salary)

lasso <- glmnet(x=X,
       y= Y,
       alpha = 1)

IMPORTANT VARIABLE EXTRACTION

The resulting object is a list of class “glmnet”. Essentially, the results contain all the iterations through alpha and to find the minimum lambda. As such, there are several results in the object. We are primarily interested in extracting the remaining variables at minimum lambda along with their coeffiecients. The vip package can automate the plotting of the results. However, we want to extract the variables names in order to fit them on our training test. There is probably a package to assist with this step somewhere in the wild, however, we haven’t found it yet. Luckily in R, we can write custom functions. The code below details the variable_extractor function.

see code

variable_extractor <- function(a_list){
  min_lambda <- data.frame(best_lambda =a_list[["lambda"]]==min(a_list[["lambda"]]))
  min_lambda <- cbind.data.frame(data.frame(index = rownames(min_lambda)),
                                 min_lambda)
  min_lambda <- min_lambda %>% 
    filter(best_lambda == TRUE)
  min_lambda <- as.double(unique(min_lambda$index))
  lasso_variables <- as.matrix(coef(a_list))|>data.frame()
  lasso_variables <- cbind.data.frame(variables = rownames(lasso_variables),
                                      lasso_variables)
  lasso_variables <- tibble(lasso_variables)
  lasso_variables <- lasso_variables[,c(1,min_lambda+1)]
  names(lasso_variables) <- c("variable","importance")
  lasso_variables <- lasso_variables %>% 
    filter(variable != "(Intercept)",
           importance != 0.00000000) %>% 
    arrange(desc(importance))

  return(lasso_variables)
}

lasso_variable <- variable_extractor(lasso)

The function takes a list of attribute “glmnet”. Thereafter, we find the smallest lambda. In addition, we extract all the coefficients from the object and store them as a wide dataframe. The dataframe is subset to only contain the variable name along with the coefficients of the smallest lambda value. We discard the intercept and variable with coefficients that are equal to 0.0000000. The final output is identical to the variables extracted by the vip package. Finally, we subset both the training dataset and the testing dataset to only contain the selected independent variables. Since the outcome variable is expressed in USD terms, we log the outcome variable.

TIDYMODELS: A CLEAN INTERFACE FOR MACHING LEARNING

The R programming language doesn’t not lack methods for running machine learning algorithms, it is after all, a statistical programming language. The tidymodels metapackage aims to provide a standard interface for modelling and machine learning using tidyverse principles. Below, we use the package to complete a number of steps including, crossfold_validation, model specification, fitting on the resamples and finally fitting the model on the training dataset.

see code

Salary_Folds <- vfold_cv(Salary_Train)

lm_spec <- linear_reg(engine = "lm")

lm_recipe <- recipe(new_annual_salary ~.,Salary_Train) %>% 
  step_nzv(all_predictors())

lm_wf <- workflow(lm_recipe,lm_spec)

doParallel::registerDoParallel(cores = 10)
ctrl_preds <- control_resamples(save_pred = TRUE)
cv_results <- fit_resamples(lm_wf,Salary_Folds,control = ctrl_preds)

lm_wf <- fit(lm_wf,Salary_Train)

collect_metrics(cv_results,summarize = FALSE) %>% 
  filter(.metric == "rsq") %>% 
  summarise(avg_estimate = mean(.estimate),
            .groups = "drop")

# A tibble: 1 × 1
  avg_estimate
         <dbl>
1        0.268

Across all 10 validation folds, the linear regression model’s adjusted R-square averaged 0.272. It is possible to tune the parameters and use battery of other machine learning models to improve performance. In the previous post, we utilised random forest to try an improve the model. Despite, requiring additional computational power, the increases in peformance were marginal to neglible.

At this point, we have completed our first objective. We are able to reproduce the LASSO regresison results in the previous post. The next objective is to determine the parsimony of the model. Here, we fit model on the test data.

RESULTS

Training Dataset

Characteristic	Beta (SE)^1,2
Linear Regression Model: Training Dataset
Depedent Variable: log(Annual Salary(USD))
(Intercept)	11*** (0.022)
years_of_experience_in_field_X21…30.years	0.43*** (0.023)
job_title_director	0.32*** (0.013)
years_of_experience_in_field_X11…20.years	0.30*** (0.019)
industry_government	0.30*** (0.085)
highest_level_of_education_completed_PhD	0.26*** (0.017)
city_york	0.25*** (0.016)
years_of_experience_in_field_X8…10.years	0.24*** (0.020)
job_title_engineer	0.23*** (0.016)
industry_tech	0.27 (0.182)
job_title_senior	0.15*** (0.012)
years_of_experience_in_field_X5.7.years	0.15*** (0.019)
industry_health	0.15 (0.087)
industry_finance	0.28 (0.182)
how_old_are_you_X35.44	0.12*** (0.018)
how_old_are_you_X45.54	0.11*** (0.020)
job_title_manager	0.11*** (0.010)
how_old_are_you_X25.34	0.08*** (0.017)
industry_computing	0.03 (0.182)
highest_level_of_education_completed_Master.s.degree	0.04*** (0.008)
industry_engineering	0.04 (0.062)
industry_manufacturing	0.04 (0.061)
job_title_analyst	0.04* (0.015)
years_of_experience_in_field_X2…4.years	0.04* (0.018)
gender_Woman	-0.03*** (0.009)
industry_administration	-0.04 (0.089)
industry_care	-0.08 (0.087)
industry_banking	-0.23 (0.183)
race_White	-0.12*** (0.010)
overall_years_of_professional_experience_X11…20.years	-0.14*** (0.023)
overall_years_of_professional_experience_X21…30.years	-0.14*** (0.023)
overall_years_of_professional_experience_X8…10.years	-0.15*** (0.024)
overall_years_of_professional_experience_X5.7.years	-0.18*** (0.024)
industry_education	-0.19*** (0.013)
industry_nonprofits	-0.19*** (0.014)
highest_level_of_education_completed_Some.college	-0.20*** (0.014)
overall_years_of_professional_experience_X2…4.years	-0.22*** (0.024)
industry_public	-0.22*** (0.053)
job_title_assistant	-0.24*** (0.015)
Adjusted R-square: 0.2732
¹ p<0.05; p<0.01; **p<0.001
² SE = Standard Error

Test Dataset

Characteristic	Beta (SE)^1,2
Linear Regression Model: Test Dataset
Depedent Variable: log(Annual Salary(USD))
(Intercept)	11*** (0.038)
years_of_experience_in_field_X21…30.years	0.30*** (0.040)
job_title_director	0.33*** (0.023)
years_of_experience_in_field_X11…20.years	0.27*** (0.033)
industry_government	0.19 (0.123)
highest_level_of_education_completed_PhD	0.27*** (0.029)
city_york	0.27*** (0.027)
years_of_experience_in_field_X8…10.years	0.21*** (0.033)
job_title_engineer	0.27*** (0.028)
industry_tech	-0.23 (0.344)
job_title_senior	0.17*** (0.020)
years_of_experience_in_field_X5.7.years	0.10** (0.032)
industry_health	0.15 (0.119)
industry_finance	0.16 (0.486)
how_old_are_you_X35.44	0.07* (0.031)
how_old_are_you_X45.54	0.10** (0.034)
job_title_manager	0.14*** (0.017)
how_old_are_you_X25.34	0.03 (0.029)
industry_computing	0.47 (0.344)
highest_level_of_education_completed_Master.s.degree	0.03* (0.014)
industry_engineering	0.26* (0.115)
industry_manufacturing	-0.18 (0.112)
job_title_analyst	0.08** (0.028)
years_of_experience_in_field_X2…4.years	0.00 (0.031)
gender_Woman	-0.03 (0.016)
industry_administration	0.14 (0.146)
industry_care	-0.07 (0.120)
industry_banking	-0.08 (0.487)
race_White	-0.10*** (0.017)
overall_years_of_professional_experience_X11…20.years	-0.05 (0.039)
overall_years_of_professional_experience_X21…30.years	-0.05 (0.039)
overall_years_of_professional_experience_X8…10.years	-0.05 (0.040)
overall_years_of_professional_experience_X5.7.years	-0.10* (0.041)
industry_education	-0.20*** (0.022)
industry_nonprofits	-0.19*** (0.024)
highest_level_of_education_completed_Some.college	-0.26*** (0.025)
overall_years_of_professional_experience_X2…4.years	-0.12** (0.040)
industry_public	-0.30*** (0.088)
job_title_assistant	-0.17*** (0.026)
Adjusted R-square: 0.2624
¹ p<0.05; p<0.01; **p<0.001
² SE = Standard Error

CONCLUSION

The model performs well on an out-of-sample dataset with year of experience (21 - 30 years), job title (Director) and highest level of education (PhD) being among the most important predictors of annual salary. Provided the questions utilised in 2022 are similar or identical to the 2021 survey, the model above can be evaluated on the 2022 survey results.

REFERENCES

Silge,J. 2021.Fit and predict with tidymodels for #TidyTuesday bird baths in Australia. Available From: https://juliasilge.com/blog/bird-baths/ (Accessed 24 December 2021).

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

Sam Firke (2021). janitor: Simple Tools for Examining and Cleaning Dirty Data. R package version 2.1.0. https://CRAN.R-project.org/package=janitor

Kuhn et al., (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org

Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22. https://www.jstatsoft.org/v33/i01/.

Richard Iannone, Joe Cheng and Barret Schloerke (2021). gt: Easily Create Presentation-Ready Display Tables. R package version 0.3.1. https://CRAN.R-project.org/package=gt

Thomas Mock (2021). gtExtras: A Collection of Helper Functions for the gt Package. R package version 0.2.2.11. https://github.com/jthomasmock/gtExtras

Sjoberg DD, Whiting K, Curry M, Lavery JA, Larmarange J. Reproducible summary tables with the gtsummary package. The R Journal 2021;13:570–80. https://doi.org/10.32614/RJ-2021-053.